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Abstract — In order to provide high data reliability, distributed 
storage systems disperse data with redundancy to multiple 
storage nodes. Regenerating codes is a new class of erasure codes 
to introduce redundancy for the purpose of improving the data 
repair performance in distributed storage. Most of the studies 
on regenerating codes focus on the single-failure recovery, but 
it is not uncommon to see two or more node failures at the 
same time in large storage networks. To exploit the opportunity 
of repairing multiple failed nodes simultaneously, a cooperative 
repair mechanism, in the sense that the nodes to be repaired can 
exchange data among themselves, is investigated. A lower bound 
on the repair-bandwidth for cooperative repair is derived and a 
construction of a family of exact cooperative regenerating codes 
matching this lower bound is presented. Q 

Index Terms — Distributed Storage, Regenerating Codes, Era- 
sure Codes, Repair-Bandwidth, Network Coding. 

I. Introduction 

Distributed storage systems such as Oceanstore [1] and 
Total Recall [2J provide reliable and scalable solutions to the 
increasing demand of data storage. They distribute data with 
redundancy to multiple storage nodes and the data can be 
retrieved even if some of nodes are not available. When erasure 
coding is used as a redundancy scheme in distributed storage, 
the task of repairing a node failure becomes non-trivial. A 
traditional way to repair a failed node is to download and 
reconstruct the whole data file first, and then regenerate the 
lost content (e.g., RAID-5, RAID-6). Since the size of the 
original data file may be huge, a lot of traffic is consumed for 
the purpose of repairing just one failed node. 

In order to reduce the total traffic required for repairing, 
called repair-bandwidth, a new class of erasure codes, called 
regenerating codes |3|, is presented and has a significantly 
lower traffic consumed in regenerating a failed node. The main 
idea of regenerating codes is to reduce repair-bandwidth from 
the survival nodes to a new node (called a newcomer), which 
regenerates the lost content in the failed node. Some con- 
structions of minimum repair-bandwidth regenerating codes 
are given in |4|, |5|. They are based on exact repair or called 
exact MBR codes, which means the lost content of the failed 
node are repaired exactly. 

Most of the studies on regenerating codes in the literature 
are for the single-failure recovery or one-by-one repair When 
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the number of storage nodes becomes large, the multi-failure 
case is not infrequent, and we need to regenerate several failed 
nodes at the same time. In addition, in practical systems such 
as Total Recall, a recovery process is triggered only after 
the total number of failed nodes has reached a predefined 
threshold. These facts motivates the regeneration of multiple 
failed nodes jointly, instead of repairing in a one-by-one man- 
ner A repair process in which the newcomers may exchange 
packets among themselves, called a cooperative repair or 
cooperative recovery, is first introduced in ||6]. We will call 
the regenerating codes for multiple failures with cooperative 
repair cooperative regenerating codes. In |7|, a special class 
of cooperative regenerating codes is proposed, in which the 
newcomers can select survival nodes for repairing flexibly. 
In 1 8 1, an explicit construction of cooperative regenerating 
code minimizing the storage in each node is given. 

The tradeoff spectrum between repair-bandwidth and stor- 
age for cooperative regenerating codes is given in [8|, [9|. 
Regenerating codes which attain one end of this spectrum, 
corresponding to the minimum storage, are considered in (6], 
fT\. In this paper, we focus on the other end of this spectrum. 
Codes which minimizes repair-bandwidth is called Minimal 
Repair-Bandwidth Cooperative Regenerating (MBCR) codes. 

Main Results: After presenting a simple example and 
demonstrating the basic ideas in Section |II] we derive in Sec- 
tion ini] a lower bound on the repair-bandwidth in cooperative 
recovery. An explicit construction of a family of exact MBCR 
codes matching this lower bound is given in Section |IV] 

II. An Illustrative Example 

In this section, we introduce some notations and illustrate 
the basic idea of cooperative repair 

Based on the system model introduced in ||3] and ||6|, a file 
consisting of B packets is encoded and distributed to n nodes 
and a data collector can retrieve the file by downloading data 
from any k of n nodes. When r nodes fails, r newcomers 
are selected to repair the failed nodes. The repair process 
is divided into two phases. In the first phase, each of the r 
newcomers connects to d surviving nodes and downloads some 
packets. In the second phase, the newcomers exchange some 
packets among themselves. The objective is to minimize the 
total number of the packets transmitted (i.e., repair-bandwidth) 
in the two phases. Next we give an illustration of cooperative 
repair with parameters n = 4 and d = k = r = 2. 
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Fig. 1. An example of cooperative repair. The labels of the solid (resp. 
dashed) arrows indicate the packets transmitted during the first (resp. second) 
phase of the repair process. The content of the newcomers after the first phase 
of the repair process is shown. 

We initialize the distributed storage system by dividing a 
data file into eight data packets A, B, . . . ,H, and distribute 
them to four storage nodes. Each node stores five packets: four 
systematic and one parity-check (Fig. [T]!. The addition "+" is 
bit-wise exclusive-OR (XOR). The first node stores the first 
four packets A, B, C, D, skips the packet E, and stores the 
sum of the next two packets, F + G. The content of nodes 
2, 3 and 4 can be obtained likewise by shifting the encoding 
pattern to the right respectively by 2, 4 and 6 packets. It is 
easy to verify that a data collector can rebuild the file from 
any two of four nodes in the illustrated code. For example, the 
data collector which connects to nodes 1 and 2 can reconstruct 
the eight data packets by downloading A, B, C and F + G 
from node 1, and D, E, F, and H + A from node 2. Then 
it can solve for G by subtracting F from F + G, and H by 
subtracting A from H + A. 

As for the repair process, the illustrated code costs ten 
packets per any two-failure recovery. Suppose that nodes 1 
and 3 fail (see the first diagram in Fig. [T). Both newcomers 

1 and 3 first download four packets from the survival nodes 

2 and 4. Then newcomer 1 (resp. newcomer 3) computes the 
sum B + G (resp. F + G) and sends it to newcomer 3 (resp. 
newcomer 1). Obviously, a total of ten packets, which are 
equal to the number of lines (including solid and dashed lines), 
are transmitted in the network. Similarly, should nodes 2 and 4 
fail, the same repair-bandwidth is consumed for regeneration. 
Suppose that node 1 and 4 fail (see the second diagram in 
Fig. [T]). Both newcomers 1 and 4 first download four packets 
from the survival nodes 2 and 3. Note that among the four 
downloaded packets, newcomer 1 (resp. newcomer 4) receives 
one encode packet F + G (resp. D + E) from node 3 (resp. 



node 2). Then newcomer 1 solves for packet B by subtracting 
C from B + G, and transmits packet B to newcomer 4. Also, 
newcomer 4 solves for packet A and sends it to newcomer 1 . 
Clearly, a total of ten packet transmissions are sufficient. 
Similarly, if any pair of two adjacent storage nodes fail, we 
can also repair them with ten packet transmissions, using the 
symmetry in the encoding for data distribution. 

III. Lower Bound on Repair-Bandwidth for 
Multi-Loss Cooperative Repair 

In this paper, we assume that the storage nodes are symmet- 
rical; for the storage cost, each node stores a packets, and for 
the repair-bandwidth, each newcomer connects to d existing 
nodes and downloads (3i packets from each of them, and then 
sends (32 packets to each of the r — 1 other newcomers. In 
this paper, we only consider the case that d > k. The repair- 
bandwidth per newcomer, denoted by 7, is defined as the total 
number of the packets each newcomer receives, and thus is 
equal to 

7 = d/3i + (r-l)/32. 

The aim of this section is to derive a lower bound on 7. 

To formulate the problem, we draw an information flow 
graph as in ||6l. Given parameters n, k, d, r, a, /3i and /32, we 
construct an information flow graph G = (V,£) as follows. 
The vertices are grouped into stages. 

« In stage —1, there is only one vertex S, representing the 
source node which has the original file. 

• In stage 0, there are n vertices Outi, Out2, . . . , Out„, 
each of them corresponds to an initial storage node. There 
is a directed edge with capacity a from S to each Out^. 

• For t = 1,2,3,..., suppose r nodes fail in stage t. 
Let the indices of these r storage nodes be St = {ji, 
j2, ■ • ■ ,jr}- For each i G St, we put three vertices hi, 
Midi and Out^ in stage t. There are d directed edges, 
with capacity /3i from d "out" vertices in previous stages 
to each hi. For each i e St, we put a directed edge 
from hi to Midi with infinite capacity, and a directed 
edge from Midi to Outi with capacity a. For each pair 
of distinct indices i, j S St, we draw a directed edge from 
hi to Midj with capacity /32- The edges with capacity /?i 
represent the data transferred from existing storage nodes 
to newcomers, and the edges with capacity /32 represent 
the data exchanged among the newcomers. 

• To a data collector, who shows up after s repair processes 
have taken place, we put a vertex DC in stage s and 
connect it with k "out" vertices with distinct indices in 
stage s or earlier The capacities of these k edges are set 
to infinity. 

An example of information flow graph for n = 4 storage nodes 
and d = r = fc = 2is shown in Fig. ID 

To derive a lower bound on d/3i + {r — l)(32 for a file of fixed 
size B is equivalent to derive a upper bound of B for given 
capacities /3i and /32 in the information flow graph. So we can 
apply a celebrated max-flow theorem in lITOl . which says the 
size of the data file B cannot be larger than the max-flow from 
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Fig. 2. An example of infomiation flow graph. 

S to any data collector (DC). The max-flow is the maximal 
value of all feasible flows from S to DC. Here, a flow from S 
to DC, called an (S, DC)-flow, is a mapping F from the set 
of edges to the set of non-negative real numbers, satisfying (i) 
for every edge e, F{e) does not exceed the capacity of e, (ii) 
for any vertex v except the source vertex S and the terminal 
vertex DC, the sum of F{e) over edges e terminating at v is 
equal to the sum of F{e) over edges e going out from v. 

The value of an (S, DC)-flow F is defined as 



e=(M,DC)e£ 

From the max-flow-min-cut theorem, we can upper bound 
the value of a flow by the capacity of a cut. For a given 
data collector DC, an (S, DC)-cut is a partition {U,li) of the 
vertices in the information flow graph, such that S G and 
DC e U, where U stands for the complement of W in the vertex 
set V. The capacity of an (S, DC)-cut is defined as the sum 
of capacities of the edges from vertices in U to vertices in U. 
Next, we will use the fact that the value of any (S, DC)-flow is 
less than or equal to the capacity of any (S, DC)-cut, together 
with the max-flow theorem in LIOJ . to prove the following 
theorem. 

Theorem 1. If d > k, the repair-bandwidth d/3i + (r — l)/32 
is lower bounded by 

B{2d + r~l) 
k{2d + r - k) 

and this lower bound can be met only when 

2B B 
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Proof: Consider a data collector which downloads data 
from k out of n storage nodes. By re-labeling the storage 
nodes, we can assume without loss of generality that the 
corresponding k "out" vertices be Outi, Out2, . . . , Outfe. 
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Fig. 3. A sample cut through the information flow graph. 

Suppose that these k "out" vertices belong to stages 1 to s, 
and for v = 1, 2, . . . , s, there are £^ "out" nodes in U in 
stage V. By vertex re-labeling, we can assume without loss 
of generality that Outi, Out2, . . . , Out^^ belong to stage <i, 
Out£j+i, • • • , Out^j^+£2 belong to stage t2, and so 

on. For notational convenience, we let — 0. Consider the 
(S, DC)-cut with U consisting of the vertices 

y {Irij, Midj, Outj} 

j=£o+«i + ...+C-i + l 

in stage v, for v — 1, 2, . . . , s, and DC. We say that the cut 
thus defined is of type i'2, . . . , ^s)- An example of cut with 
type (2, 1, 2) is shown in Fig. [5] 

We claim that the capacity of an (S, DC)-cut of type 
(^1 , ^2 , • • • , ) can be as small as 



E 



[£,(d-E^j)/3i+4(r-4)/32 



(3) 



1, each with capacity terminating 



In stage 1, there are 
at Ini, In2, . . . , In^^. 

There are tid edges, each with capacity /3i, terminating at 
Irii, In2, . . . , In^j in stage 1. Also, there are £i(r — £1) edges, 
each with capacity /32, terminating at Midi, Mid2, ■ • • , Mid^^. 
Hence, a total of iid^i + ^i(r — li)P2 are contributed to the 
summation in This is the summand corresponding X.ov = l 
in (O. 

For the second group of £2 storage nodes in stage 2, 
there may be £1 links from the first group of storage nodes, 
which are not counted in the capacity of the cut. The sum 
of capacities of edges terminating at the "in" vertices in U 
in stage 2 could be as small as i2{d — f i)/3i. Together with 
the sum of capacities of the edges terminating at the "mid" 
vertices, a total of i2{d — ii)(3i + £2{r — (.2)^2 are contributed 
to Q. This is the second summand in The rest of the 
summands can be derived similarly. This finishes the proof of 
the claim. 

For a data file of size B, we should be able to construct 
a flow of value at least B. Hence B is less than or equal 
to ^ for all type (£1, ^2, • • ■ , 4) with Q < < r fox all 

= 1, 2, . . . , s, and + . . . + = k. After some algebraic 
manipulations, we have the following upper bound on B, 

s i—l s 

B < dkf3i + r fc/32 - /3i E E " 



i=i j=i 



4=1 



We note that if we substitute /3i and (32 by 2B/{k{2d+r-k)) 
and /32 = B /{k{2d+r—k)) respectively, then we have equaUty 
in (gli. 

We finish the proof by considering the two cases. 

Case 1: k < r. Consider the cut of type (1,1,..., 1). 



From (|4|i, we obtain 



B <{dk~ k{k - l)/2)/3i + (rfc - fc)/32 



(5) 



From the cut of type (fc,0,0,...,0), we have the following 
constraint, 

B < dkPi + {r- k)k(32 (6) 

If we multiply Q by 2d, multiply (|6]l by r — 1, and add the 
two resulting inequalities, we get 

{2d + r- 1)B < k{2d + r~ k){dPi + {r - 1)^2)- 

This proves the lower bound in ([T} in Case 1 . 

To see that the lower bound can be met only when (3i and 
132 are specified as in the theorem, we notice that (|5]l and (|6) 
define an unbounded polyhedral region in the /3i-/32 plane, 
with (|2]l as a vertex point. If we want to minimize the objective 
function d(3i + (r — l)/32 over all point (32) in this region, 
the optimal point is precisely the point given in (|2]i. 

Case 2: k > r. Consider a cut of type (r, r, . . . , r, 6), where 



a — \k/r\ 
becomes 



B < 



dk 



and b 



r'^a{a - 1) 



ra. 



The upper bound of i? in (HJi 



abr 



(3i + {rk - ar^ -b^)P2 (7) 



Together with the constraint obtained from a cut of type 
(1, 1, . . . , 1), we set up a linear program and minimize d(3i + 
[r — l)/32 over all (/3i,/32) satisfying the inequalities in (|5]l 
and (Q. 

Let ii be the straight line in the /3i-/32 plane by setting 
the inequality in (|5]l to equality. Let L2 be the straight line 
consisting of point (/3i,/32) satisfying (|7]i with equality. We 
can verify that the intersection of Li and L2 is the point in (|2|i. 

We investigate the slopes of Li and L2. The slope of Li is 
equal to —{d — [k — l)/2)/(r — 1), which is larger than the 
slope of the objective function, —d/{r — 1). For the slope of 
L2, we first check that 



dk - ^) 

2 

dk — r^a{a — 1) - 
d > r{a - 1) + b 



— abr > db 

- abr > db 



d > k - 



We have used several times that k = ar + b in the above 
derivation. The last line holds by the assumptions d > k and 
r > 2. Therefore, 



dk — !_2l| — 11 — abr 



rk 



62 



< 



db 



rk 
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1 



The slope of L2 is strictly less than the slope of the objective 
function. Thus, the optimal point of the linear programming 



problem is the vertex in (|2]i. This completes the proof of 
Case 2. ■ 
Note: Theorem [T] is obtained independently in 
We can now show that the regenerating code discussed in 
Section is optimal, in the sense that given the parameters 
B, d, k and r, the repair-bandwidth matches the lower bound 
in Theorem [T] We have B — 8 and d = k = r = 2 in the 
example. From Theorem [T] the repair-bandwidth cannot be 
less than Sj^^q^E^^ — 5. We have shown in Section that 
the repair process requires exactly 5 packet transmissions per 
failed node, and therefore matches the optimal value. 

For non-cooperative or one-by-one repair, it is proved 
in lO that the minimum repair-bandwidth per failed node 
is 2dB / {k{2d + 1 — fc)), which turns out to be the same 
as the left hand side of ([U with r set to 1. If we apply 
a non-cooperative regenerating code to a distributed storage 
system with parameters as in Section |II] the minimum repair- 

2-2-8 



bandwidth is 2(2 2+1-2) ~ 5.333. From this simple example, 
we can see that repair-bandwidth can be further reduced if 
some data exchange of data among the newcomers is allowed. 

The lower bound of repair-bandwidth in Theorem [T] in fact 
holds via random linear coding with field size large enough. 
The tightness of the lower bound is established in fTl l, by 
showing the existence of MBCR codes which match the lower 
bound. Thus, the minimum repair-bandwidth for MBCR is 
indeed equal to B{2d + r - l)/(A:(2fi + r - fc)). 

IV. An Explicit Construction of a Family of 
Optimal MBCR Codes 

We construct in this section a family of exact MBCR code 
with parameters d = k and n = d + r. In fact, the illustrated 
code in Section |ll] is a special case in this family. 

The whole file is first divided into stripes. Each stripe 
consists of i? = k{2d + r — k) = kn data packets, 
considered as elements in GF{q), where g is a prime 
power In each stripe let the kn data packets be xq, 
xi, . . . , Xkn-i- We divide them into n groups. The first group 
consists of Xo,xi, . . . ,Xk-i, the second group consists of 
Xk,Xk+i, ■ ■ ■ ,X2k-i, and so on. For notational convenience, 
we let Xj = [x{j-i)k ... X(^j_^k+k-i] be the 

vector of the data packets in the jth group (1 < j < n). 

For i — 1,2, ... ,n, we construct the content of node i as 
follows. We first put the k data packets in the i-th group 
into node i and then n ~ 1 parity-check packets 

• Vi, Xi02 • V2, . . . , Xi0(„_i) • V„_i 

into node i, where "•" is the dot product of vectors and © is 
modulo-n addition defined by 



X (By := 




if X + y < n, 
if X + y > n. 



Here Vj (j ~ 1,2, . . . , n~l) are column vectors in a fcx (n— 1) 
generating matrix, G = [vi V2 ... v„_i], of a maximal- 
distance separable (MDS) code over GF{q) of length n — 1 
and dimension k. By the defining property of MDS code, any 
k columns of G are linearly independent of GF{q). 



As for the file reconstruction processing, suppose without 
loss of generality that a data collector connects to nodes 1, 
2, . . . ,k. The systematic packets a;o, a^ii ■ • ■ , 2:^,2 in the first 
k groups can be downloaded directly, because they are stored 
in node 1 to node k uncoded. The jth group of data packets 
(j > k) (the components in vector x^) can be reconstructed 
from Xj • Vj _ 1 , • _ 2 , • • • , • Vj _ fc , by the MDS property. 
A data collector connecting to any other k storage nodes can 
decode similarly. 

As for the cooperative repair processing, suppose without 
loss of generality that nodes fc + 1 to n fail at the same time. 
The repair process proceeds as follows. 
Step 1: For « = 1, 2, . . . , fc, node i computes ■ 'Vn+i-j and 
sends it to newcomer j, for j = fc + 1, fc + 2, . . . , n. 
For j — fc + l,fc + 2,...,n, newcomer j downloads 
fc packets Xj ■ Vj_i, Xj • Vj_2, ■ • • , • ^j-k from 
nodes 1 to fc. 

For i = fc + 1, fc + 2, . . . , n, newcomer j can solve 
for the systematic packets in Xj . Then node j sends 
X,, -Vi-^,- to node n~i + \, for i — 1,2,..., n—j, and 

l,2,...,7-fc-l. 



Step 2: 



Step 3: 



sends x . 



Vi to node j — i, for i 



In steps 1 and 2, a total of 2fc(n — fc) = 2fcr packets are 
transmitted. In step 3, each newcomer transmits r — 1 packets. 
The total number of packets required in the whole repair 
process is 2fc7- + r{r — 1) = r{2d + r — 1). The number of 
packets per failed node is therefore 2d + r — 1. According to 
Theorem [T] the repair-bandwidth is no less than 

2d + r - 1 



B- 



= 2d 



1. 



fc(2rf + r - fc) 

Thus, this regenerating code is optimal. 

Remark: If n — q + 2 for some prime power q, we can use 
an extended Reed-Solomon (RS) code of length g + 1 in the 
construction. The alphabet size could be as small as n — 2. 
We refer the reader to 1.1 2 j for the construction of extended 
RS code. 

Example: An example for n = 5, (i = fc = 3 and r = 2 is 
shown in Fig.H] A stripe of file data is divided into 15 packets 
xq, xi, . . . , X14. Let q — 2 and G be the generating matrix 

"1100" 
10 10 
10 1 



G = [Vi V2 V3 V4] = 



of a triply-extended Reed-Solomon code over GF{2) lfT2l . The 
ith row of the array in Fig. 2] indicates the content of node i. 
For example, node 4 stores six systematic packets, xg, xiq, 
xii in X4, xi ■ V2 = Xq, X2 • V3 = X4, X3 • V4 = xg, and one 
parity-check packet X5 • vi = X12 + X13 + X14. 

Suppose that nodes 4 and 5 fail. In the first step, node 1 
sends xi • V2 and xi • vi to newcomers 4 and 5 respectively. 
Similarly, node 2 sends X2 • V3 and X2 • V2, and node 3 sends 
X3 • V4 and X3 • V3. In the second step, node 1 transmits X4 • V3 
and X5 • V4 to newcomers 4 and 5 respectively. Likewise, node 
2 transmits X4 •V2 and X5 ■ V3, and node 3 transmits X4 ■ vi and 
X5 V2. In the third step, newcomer 4 reconstructs X4, and sends 
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Fig. 4. An example for exact MBCR code for n - 



■ 3 and r - 



X4 • V4 to newcomer 5. Also, newcomer 5 reconstructs X5, and 
sends X5 -vi to newcomer 4. Lastly, the lost packets in nodes 4 
and 5 are regenerated in newcomer 4 and 5. The total number 
of packet transmissions in the whole repair process is equal to 
14. The repair-bandwidth per failed node is 7. It matches the 
theoretic lower bound 15(2 • 3 + 2 - l)/(3(2 • 3 + 2 - 3)) = 7. 

V. Conclusion 

We give a construction of a family of exact and optimal 
MBCR codes for d = k and n — d + r. The constructed 
regenerating code has the advantage of being a systematic 
code. For example, if we want to look at the content of one 
particular packet, we only need to contact the node which 
has a copy of this packet and download the packet directly. 
Another advantage of this construction is that the requirement 
of finite field size grows linearly as a function of the number 
of storage nodes. 
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